1. Intoduction

In this project, we are going to visualize the dataset 2016collisionsfinal.csv which represent collisions occurring on public roadways.

2.1 Load libraries

We import pandas to work with our data, Matplotlib to plot charts, and Seaborn to make our charts prettier.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.offline as py
import plotly.graph_objs as go
color = sns.color_palette()
sns.set(style="darkgrid")
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler

2.2 Read the Data

Let's load the GlobalCitiesPBI.csv which has been provided in datasets for the course.

In [3]:
rawdf = pd.read_csv('2016collisionsfinal.csv')
rawdf.head()
Out[3]:
Record Location X Y Date Time Environment Road_Surface Traffic_Control Collision_Location Light Collision_Classification Impact_type
0 1 105 S OF COMMISSIONER ST @ ALBERT ST 366666.2747 5030879.415 8/4/16 17:03 01 - Clear 01 - Dry 01 - Traffic signal 02 - Intersection related 01 - Daylight 03 - P.D. only 04 - Sideswipe
1 2 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365431.439 5021830.23 3/30/16 15:44 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 02 - Non-fatal injury 05 - Turning movement
2 3 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365432.2342 5021829.655 9/2/16 16:02 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 03 - P.D. only 05 - Turning movement
3 4 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365430.7265 5021830.451 3/12/16 17:30 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 03 - P.D. only 05 - Turning movement
4 5 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365430.6658 5021830.963 8/23/16 15:52 01 - Clear 01 - Dry 01 - Traffic signal 02 - Intersection related 01 - Daylight 03 - P.D. only 03 - Rear end
In [4]:
# checking missing data 
total = rawdf.isnull().sum().sort_values(ascending = False)
percent = (rawdf.isnull().sum()/rawdf.isnull().count()).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total missing ', 'Percent'])
missing_data
Out[4]:
Total missing Percent
Collision_Location 6 0.000428
Traffic_Control 3 0.000214
Light 2 0.000143
Impact_type 0 0.000000
Collision_Classification 0 0.000000
Road_Surface 0 0.000000
Environment 0 0.000000
Time 0 0.000000
Date 0 0.000000
Y 0 0.000000
X 0 0.000000
Location 0 0.000000
Record 0 0.000000

Since the missing values are a very small percentage of data, we can safely drop rows with missing values, we also rename data frame to df:

In [5]:
df = rawdf.dropna(axis=0 , how='any')
df.head()
Out[5]:
Record Location X Y Date Time Environment Road_Surface Traffic_Control Collision_Location Light Collision_Classification Impact_type
0 1 105 S OF COMMISSIONER ST @ ALBERT ST 366666.2747 5030879.415 8/4/16 17:03 01 - Clear 01 - Dry 01 - Traffic signal 02 - Intersection related 01 - Daylight 03 - P.D. only 04 - Sideswipe
1 2 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365431.439 5021830.23 3/30/16 15:44 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 02 - Non-fatal injury 05 - Turning movement
2 3 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365432.2342 5021829.655 9/2/16 16:02 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 03 - P.D. only 05 - Turning movement
3 4 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365430.7265 5021830.451 3/12/16 17:30 01 - Clear 01 - Dry 01 - Traffic signal 03 - At intersection 01 - Daylight 03 - P.D. only 05 - Turning movement
4 5 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 365430.6658 5021830.963 8/23/16 15:52 01 - Clear 01 - Dry 01 - Traffic signal 02 - Intersection related 01 - Daylight 03 - P.D. only 03 - Rear end
In [6]:
# comparing sizes of data frames 
print("Old data frame length:", len(rawdf), "\nNew data frame length:",  
       len(df), "\nNumber of rows with at least 1 NA value: ", 
       (len(rawdf)-len(df)))
Old data frame length: 14023 
New data frame length: 14012 
Number of rows with at least 1 NA value:  11
In [7]:
print('The dataset 2016collisionsfinal has {} rows and {} features'.format(df.shape[0],df.shape[1]))
The dataset 2016collisionsfinal has 14012 rows and 13 features

4. Question 1

we need to choose a list of variables (at most 7) that are crucial to a good understanding of the dataset:

  1. we gonna drop the first column is just the row index, and count for the record of collisions.

  2. we are going to remove the variables "Continent", "Country" because they are kind of redundant variable to geography

  3. the remaining 5 variables gonna be choosing as the top 5 correlated features with the output variable Sort_gen.rating

  4. Some other variables as "Life.Expectancy.in.Years..Female.", "Life.Expectancy.in.Years..Male." and "Life.Expectancy gonna be used in the analysis.

The 7 variables are :

  1. "Date"
  2. "Time"
  3. "Location"
  4. "Light"
  5. "X"
  6. "Y"
  7. "Collision_Classification"
  8. "Traffic_Control"
In [8]:
df = rawdf.dropna(axis=0 , how='any')

df = df[['Date', 'Time', 'Location', 'Light', 'X', 'Y', 'Impact_type','Traffic_Control',"Collision_Location",'Road_Surface']]
df['date_m'] =  df['Date'].str.split('/', expand=True)[[0]].astype(int)
df['time'] =  df['Time'].str.split(':', expand=True)[[0]].astype(int)
df['road_Surface'] =  df['Road_Surface'].str.split('-', expand=True)[[0]].astype(int)
df['light'] =  df['Light'].str.split('-', expand=True)[[0]].astype(int)
df['traffic_Control'] =  df['Traffic_Control'].str.split('-', expand=True)[[0]].astype(int)
df['X'] = df['X'].str.replace('\,', '').astype(float)
df['Y'] = df['Y'].str.replace('\,', '').astype(float)

df = df.drop(axis = 0, index = df[df.X > 1000000].index)
df = df.drop(axis = 0, index = df[df["light"] == 99].index)

df.head()
Out[8]:
Date Time Location Light X Y Impact_type Traffic_Control Collision_Location Road_Surface date_m time road_Surface light traffic_Control
0 8/4/16 17:03 105 S OF COMMISSIONER ST @ ALBERT ST 01 - Daylight 366666.2747 5030879.415 04 - Sideswipe 01 - Traffic signal 02 - Intersection related 01 - Dry 8 17 1 1 1
1 3/30/16 15:44 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 01 - Daylight 365431.4390 5021830.230 05 - Turning movement 01 - Traffic signal 03 - At intersection 01 - Dry 3 15 1 1 1
2 9/2/16 16:02 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 01 - Daylight 365432.2342 5021829.655 05 - Turning movement 01 - Traffic signal 03 - At intersection 01 - Dry 9 16 1 1 1
3 3/12/16 17:30 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 01 - Daylight 365430.7265 5021830.451 05 - Turning movement 01 - Traffic signal 03 - At intersection 01 - Dry 3 17 1 1 1
4 8/23/16 15:52 210 W OF MERIVALE RD @ WEST HUNT CLUB RD 01 - Daylight 365430.6658 5021830.963 03 - Rear end 01 - Traffic signal 02 - Intersection related 01 - Dry 8 15 1 1 1

5. Question 2

In this question we gonna create 4 multivariate visualizations for the dataset using the variables listed in question 1.

In [9]:
from plotly import graph_objects as go

L = df["Collision_Location"].unique().tolist()
L_count = []
for st in L:
  x = df[df["Collision_Location"] == st].shape[0]
  L_count.append(x)
fig = go.Figure(go.Funnel(
    y = L,
    x = L_count))

fig.show()
In [13]:
plt.figure(figsize=(21,6))

L = df["date_m"].unique().tolist()
L_count = []
for st in L:
    x = df[df["date_m"] == st].shape[0]
    L_count.append(x)

fig = go.Figure(data=go.Scatter(x = L,
                                y = L_count,
                                mode='markers'
                                )) 

fig.update_layout(title='number of accidents per month')
fig.show()
<Figure size 1512x432 with 0 Axes>

Notes:

  1. This Scatter plots are similar to line graphs in which we are mapping Metro population vs infant mortality rate grouped by gen_rating.
In [14]:
import plotly.express as px
paral_data = df[['date_m','time','Light','Collision_Location','Traffic_Control']]
fig = px.parallel_categories(paral_data)

fig.show()

Notes:

In [15]:
plt.figure(figsize=(16,6))

import plotly.express as px
px.scatter(df, x='X', y='Y',
           color='Impact_type',
           width=900, height=400,
           title = "relationship X and Y and Impact_type ")
<Figure size 1152x432 with 0 Axes>

Notes:

6. Question 3

Question 3 : “definitive” visualizations for the dataset.

In [17]:
plt.figure(figsize=(16,6))
sorted_ = df.groupby(['Light'])['date_m'].mean()
sns.boxplot(x=df['date_m'], y=df['Light'], order=list(sorted_.index))
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a27b78c50>
In [18]:
import plotly.express as px
px.scatter(df, x='light', y='road_Surface',color='Collision_Location', width=900, height=400,
           title = "relationship between light  and road_Surface")

Notes:

  1. This plot illustrate well the trend between life expectancy, GDP and gen_rating. All of this variables are positively correlated.

  2. Jerusalem is detecte as outlier for which is the only city in the groupe of sifficency with high life expectancy and GDP.